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Abstract 

A hierarchical structure describing the inter-relationships of species has long been a 
fundamental concept in systematic biology, from Linnean classification through to the 
more recent quest for a 'Tree of Life.' In this paper we use an approach based on 
discrete mathematics to address a basic question: Could one dehneate this 
hierarchical structure in nature purely by reference to the 'genealogy' of present-day 
individuals, which describes how they are related with one another by ancestry 
through a continuous line of descent? We describe several mathematically precise 
ways by which one can naturally define collections of subsets of present day 
individuals so that these subsets are nested (and so form a tree) based purely on the 
directed graph that describes the ancestry of these individuals. We also explore the 
relationship between these and related clustering constructions. 
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1 Introduction 



In this paper, we apply discrete mathematical arguments to study how hierarchical 
structures arise naturally from a very basic graph in systematic biology. 

Consider the collection of all organisms that ever lived on earth - this includes not just the 
set X of organism alive at present, and other organisms we can directly observe (e.g. fossil 
specimens), but a much larger set V consisting of all organisms (or vertebrates or dicots or 
...) that ever lived on this planet. There is a very natural directed graph structure on V: 
place a directed arc from u & V to v & V ii u was a 'parent' of v. Here, the word 'parent' 
means that u contributed directly to the genetic make-up of f ; in a sexually-reproducing 
population, this is the usual meaning of the word (the two parents of v are the contributors 
of the sperm and egg), while in an asexually reproducing (haploid) population, each 
individual typically has one parent (e.g. the prokaryote cell whose division led to the new 
cell) though, occasionally, v may be regarded as having additional 'parents' beyond those 
described, as a result of processes such as lateral gene transfer (LGT) or other forms of 
reticulate evolution (e.g. a hybrid taxa). 

This graph - let us call it G - can thus be regarded as a 'history of life' network, that 
describes how differen t past and present individual organisms are related to one another by 



ancestry (ISteell . 120071 ). The graph G cannot be directly observed - we have access only to a 
subset X of of 'observable' individuals along with some clues as to the gross structure of 
the rest of the graph gleaned from the genomic data of individuals in X, and other 
observable information (morphology, biochemistry, behavior, fossils etc). Nevertheless, the 
graph G is a well-defined entity, based on the premise that each organism has at least one 
parent, back to the earliest forms of life that existed on earth. 

Such a huge graph would not be of much interest were it not for Darwinian evolution. The 
idea that all life traces back to one common ancestor suggests that G is a connected graph, 
with the lines of descent of populations that we call 'species' merging (coalescing) as we 
trace their ancestry, from child to parent, backward in time. Thus, rather than being an 
isolated set of component graphs - one for each 'species' - the graph G is more like a very 
large, diffuse 'tree of populations' (see Fig. 1), where the populations occasionally split 
when a 'speciation event' occurs, for example when a population becomes separated into 
two reproductively isolated groups (a process referred to as allopatric speciation), though 
occasionally these lineages may later intersect, for example if hybrid species arise from two 
lineages. At the microbial level, with extensive LGT, and occasiona l endosymbiotic events. 



this picture may appear more like a 'net of life' (IKunin et al.l . |2005| ). 



The history of populations is usually represented in systematic biology as a rooted 
phylogenetic tree - that is a rooted tree where the leaves are labeled by the extant 'species' 
and which has edges and interio r vertices that cor r espond to ancestral 'spe cies' and 



'speciation events', respectively (IFelsensteinl . 12004 : ISemple and Steell . 120031 ). In this 



representation, the fine detail of the descent of a population through time is lost, creating 
an unfortunate separation between phylogenetics and population genetics. 



Figure 1: A simplified picture of a liistory of populations. In this example A, B and C form 
tight clusters. 



This high level picture of evolution via phylogenetic trees is problematic for two further 
reasons. Firstly, it requires one to address the much-debated notion of the nature and 
definition of 'species', a concept that is pa rticularly ambiguous at the microorganism level 



(jDoolittld . Il999l : IWheeler and Meierl . |2000| ). Secondly, it is increasingly being argued that 



processes of reticulate evolution such as LGT requir e that the evolution of 'species' shoul d 
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In this paper, we take a simple if somewhat novel approach to this issue by asking whether 
we can simply use G directly to define a tree (or tree-like structure) that reflects the 
bifurcating history of life studied in evolutionary theory, and which (i) does not require the 
prior identification or definition of 'species' and (ii) is robust to the many processes that 
can complicate a tree-like history, such as LGT. Viewing an evolutionary tree in this direct 
way is perhaps in the spirit of Darwin's suggestion to "discover and trace the many 
diverging lines of descent in our natural genealogies" (jParwiru . 118721 ). Of course, the notion 
that there is a hierarchical structure to the life we see today is a concept that came well 
before Darwinian evolution; for example, Linnean classication (jLinneaud . 117351 ) dates back 
more than 100 years before Charles Darwin's On the Origin of Species by Means of Natural 
Selection, or the Preservation of Favoured Races in the Struggle for Life appeared. 
Moreover, the nature of 'species' has been discussed much earlier - from Plato through to 
the 17th Century English naturalist John Ray. 



In this paper, we do not provide any general procedure for constructing hierarchies from 
genomic data; our interest here is purely in addressing the more fundamental questions: 



• Can we construct from G systems of clusters (subsets of X) that reflect complex 
ancestral relationships and yet behave in a nested (tree-like) fashion? 

• What are the properties of, and relationships between, different possible 
constructions? 

• What assumptions, if any, concerning evolution are required so that the clusters 
derived from G are guaranteed to form a tree? 



Fortunately, for this last question, we can be confident about one very helpful property: G 
has no directed cycles, simply because a 'parent' is always born before its child. We ask 
then whether any acyclic digraph G with a distinguished subset X of its vertex set induces 
a natural rooted tree structure on X (described in terms of a hierarchy, i.e. a system of 
nested subsets of X) that reflects the process of populations splitting and separating 
through time. We describe several ways to define such hierarchies, and we explore their 
properties and the connections between them. 

The use of discrete mathematics to investigate possible tree-like systems of classification 
arising in evolutionary biology more systematicall y has been explored by a number of 
authors from different perspectives. For example, lAldous et al.l (120081 ) recently considered 
three formal ways whereby genera could be defined in terms of species, based on a 
phylogenetic t ree, obtaining an ele gant characterization of these three classifications 
(Theorem 1 o f Aldous et al.l (120081 )). A number of authors in the edited volume 
(IMirkin et al.l . 119971 ) deal with the mathematical aspects of defining hierarchies and related 
structures in biology. However, all these approaches to date have worked at a level that is 
'higher' than G. 



Our approach combines two themes developed in our earlier (independent) investigations 
into processes whereby trees arise by general connectivity c onsiderations in tw o situations: 
(i) a general setting of locally connected topological spaces ( Dress et al.l. 2009), a nd 
particular metric space associated with ancestry within populations (jSteell . 120071 ). 



n a 



The structure of the paper is as follows. We begin by introducing some further definitions, 
followed by some comments concerning a purely 'genetic' variant of the graph G. We will 
define five general ways of obtaining a collection of subsets of X from G based on notions 
of ancestry. Our main result (Theorem [T]) asserts that these all lead to hierarchies (or a 
related structure, a weak hierarchy), and describes some connections between them. In the 
final section, we explore some properties of these constructions further. 



2 Notation 



Consider a finite, directed, and cycle-free graph (i.e. an acyclic digraph) 

G= {V,E CV xV) and the associated partial order = "^g" of V defined, for all 



u, V & V , hy u :< V if and only if there exists a (directed) path from u to v in G. i.e.. a 
sequence Uq := u,Ui, - ■ ■ , := of some length /c > of elements in V with (wj-i, Wj) G 
for all z = 1, . . . , /c in which case u will also be called an ancestor of t), and v a descendant 
of w. Note that we also write u -< v in the case where u ^ v and u ^ v holds. 

We will sometimes refer to the elements of V as individuals and, given any arrow {u, v) in 
E, the individual u will be called a parent of v and f will be a c/izW of u. Clearly, given any 
two elements u, v in V, we have {u, v) E E if and only if #{w ey:'U^i(;^t'} = 2 holds. 

Let X denote a distinguished subset of V, which we will regard as a set of 'observable 
individuals' in G (e.g. present-day individuals, and perhaps some fossil specimens). While 
no specific conditions need to be placed on X in what follows, it may be natural to assume 
that every v in V — X has a descendant in X (implying in particular that X contains all 
elements v & V that do not (yet) have any children), as eliminating all elements from 
V — X that do not have a descendant in X will not change the clusters in X we are going 
to consider below. 

For any v E V , letlj^ — denote the set of individuals in X that are descendants of v, 
and for any subset U of V, put U :— [jj^^u 'v' . 




Figure 2: An illustrative example of an acyclic digraph G, with vertex set V and X 
{a, 6, c, (i, e} C V. 



2.1 Organismal history versus genetic history 

The graph G we have defined in the introduction describes the detailed genealogical history 
of individual organisms. However it may also be of interest to consider a subgraph of this 
graph that reflects just those lines of descent that carry genetic material that survives in at 
least one of the organisms in our observed set X. Clearly it is possible for an individual 
organism that lived long ago in a diploid population to have many descendants today, and 
yet have no surviving genetic material (gene, homologous nucleotide, etc) today due to the 
processes of population genetics (a gene is inherited from one parent, not both). This 



distinction between genetic ancestry and organismic ancest ry has been n oted by many 
authors over the years, and has been discu ssed recently by iBaumI (120091 ) . and, more 
theoretically, by lMatsen and Evand (120081 ) . 



We can formalize this distinction as follows: let us say that an arc {u, v) of G is 
(genetically) trivial if none of the genome of v that is inherited from u is present in any of 
the descendants of v in X. Let Gg be the graph obtained from G by deleting all the 
genetically trivial arcs. Thus, in Gg we only retain those parent-child arcs for which the 
child inherits from that parent genetic material that survives in at least one of the observed 
individuals. 

Many of our results (including our main result. Theorem [T]) remain true for both types of 
graphs, since they are stated in the generality of a finite, directed, cycle-free graph that 
contains X within its vertex set, and Gg clearly inherits these properties from G. However, 
some examples (eg. the example of a tight cluster involving humans), and some discussion 
depends more crucially on which type of graph we are considering, and so, for the sake of 
simplicity, we will regard G as the genealogical rather than ancestral genetic graph from 
now on. 



2.2 Hierarchies and weak hierarchies 

We say that a collection Ti. of subsets of X forms a (generalized) hierarchy on X if Ti. 
satisfies the nesting property: 

A,B en^ AnB e {(ls,A,B}. 

Note that this condition is also referred to in the hypergraph literature as a laminar family, 
and the word 'hierarchy' often also requires further conditions such as X G 7i, ^ 7i, or 
{x} G X for all x G X. Here, however, we will insist on the nesting property, only. 

A natural bijection exists between (isomorphism classes of) rooted X -forests and 



hierarchies on X that do not contain the empty set (see, for example, lEdmonds and Giles 



(Il977l ). Section 8) which restricts to a bijection between the set of (isomorphism classes of) 
rooted X-trees and the set of hierarchies on X that contain X but not 0. In particular, 
|H| < 2|X| holds f or eve ry hierarchy Ti. (maximal hierarchies are considered further by 



Bocker and DressI (120001 )). Note also that if ?i is a hierarchy on X, then so is any subset of 
Ti., and also that, for any set Y G X, the collection {A (lY : A & Ti} is a hierarchy on Y. 
Given any collection V of subsets of X, there is a simple way to define an associated 
hierarchy H-p by setting: 

(1) np:={C eV: VC" eV,CnC' e {C, C', 0}}. 



A weaker condition than that satisfied by a hierarchy is the condition: 



A,B,Cen^AnBnCe{AnB,BnC,AnC}. 



If Ti. satisfies this condition, it is said to form a weak hierarchy. Weak hierarchies share 
some properties with 'proper' hierarchies (for example, cluster s can be identified using a t 
most two elements from X), and these are explored further by lBandelt and Dresd (119891 ): 
moreover, as with a hierarchy, there is a polynomial bound on the size of a weak hierarchy 
in terms of \X\: We have |7i| < ('"^2^^) for any weak hierarchy that does not contain the 
empty set. 



2.3 Connectivity through evolution 



Evolution suggests that all organisms we can observe today descended from a small group 
of common ancestors and this suggests that the graph G is connected in various possible 
ways. These are summarized by the following, increasingly liberal connectivity 
requirements: 

(CI) G contains a vertex v with ~v' = X. 

(C2) For all x,y & X, there exists v & V with v ^ x,y. 

(C3) The graph r(X) := {X, {{x,y} e {^) : 3v e V : x,y e ^}) is connected. 



In the biological context. Condition (CI) is merely the statement that all living organisms 
today have (at least) one common ancestor some time in the past. Condition (C2) says 
that every pair of individuals in X has a common ancestor, while Condition (C3) says any 
two individuals in X are related through a chain of relatives in X. Mathematically, (C2) 
implies that T{X) is a complete graph; moreover, we have (CI) ^ (C2) ^ (C 3). Although 
(CI) is usually held to be biological ly reasonable (Crick, 1968 : Futuyma . 19981 : 



Sober and Steell . l2002l : IWoesd . I2OOOI ). we do not necessarily assume this condition here; the 
choice of any particular condition (C1)-(C3) is relevant only for two reasons: (i) It can 
determine whether or not X is an element of some of the hierarchies we construct and (ii) 
Condition (C2) can be helpful to ensure the existence of clusters defined by pairwise 
ancestral relationships. 



3 X-Clusters from G 



We now describe a variety of ways whereby an acyclic digraph G with X CV can naturally 
give rise to specific collections of subsets of X based on concepts of ancestry. In Section 4, 
we will show how these constructions lead to (weak) hierarchies. 



3.1 Tight clusters 



We begin with an intuitively simple way to generate clusters on X from any acyclic 
digraph G = (V, E) with X <0 V. Although the conditions a cluster must satisfy in this 
first definition are more severe than those we consider later, we will describe in the remark 
below how results in population genetics provide some justification for the existence of such 
tightly-constrained clusters. 

For a non-empty subset C of X, let D(DC) denote the set of all individuals v &V whose 
descendants contains every individual in C, let D((1C) denote the set of individuals in V 
all of whose descendants in X are contained in C, and let D(=C) := D(DC) fl D((1C) 
denote the set of all individuals in V whose descendants in X coincides exactly with C. 
That is, we put: 

D(DC) := {v eV -.1^ ^C}, D(CC) := {v eV -.If CC}, 

and put: 

D(=C) := {v eV :ir = C}, 

So, D(=C) consists of all individuals in V that are ancestral exactly to every element in C, 
but no other elements in X. 

We define a subset C of X to be a tight cluster (in X relative to G) if and only if it is 
non-empty and Di=C) separates C from X — C, that is, every (undirected) path from an 
element in C to an element in X — C contains some element from D(=C). 

Note that for any non-empty subset C of X and any non-empty subset V of D(DC), we 
have C C f]^^y, ~v C V' = IJt,ev" ^ ^^^^ 

(2) V' = C ^ V'CD{=C). 

Clearly, a subset C of X is a tight cluster if and only if just one subset = Vc of D(=C) 
separates C from X — C. 

As an example, the non-singleton tight X-clusters of the graph G shown in Fig. 2 are 
{a, 6} and X, as D(= {a,h}) = {vi,V2,v} holds where v is the left-hand parent of Vi and 
V2, and this set clearly separates {a, b} from {c, d, e}; yet the subset {vi, V2} of D{= {a, b}) 
also separates {a,b} from {c,d,e}. 

Notice that X itself is a tight cluster if and only the strongest connectivity condition (CI) 
holds. Notice also that the set of tight X — clusters of G is always a subset of the hierarchy 
H-p defined in ([1]) for P = { : v E V}, though, in general, the latter set can be strictly 
larger than the set of tight X-clusters of G. 

The concept of a tight cluster is a relaxation of the notion of 'organismic exclusivity' 



described recently by iBaumI (120091 ). which requires that there is an element in D{= C) that 
separates C from X — C. 



3.2 An example of a tight cluster in recent evolution 



The conditions for a tight cluster are strong. However, results in population genetics 
suggest that for diploid (sexually-reproducing) populations, it may sometimes be 
reasonable. This is because, under a neutral model of random diploid mating, IChang 



(119991 ) showed that if we trace back the ancestry of a set of n extant individuals by (at 
least) 1.771og2(n) generations, the population extant at this earlier time is likely to have 
the property that each individual in this ancestral population either has no extant 
descendants, or has all n extant individuals as descendants. This sharp log2(n) behavior 
was shown to extend to more realistic models of human mating behaviour, including 
migration, at the price of a constant larger than 1.77 by lRohde et al. (120041 ). 



The significance of this finding can be illustrated by considering, for example, the entire 
extant human popu lation Phnm a s a sub set of the set X of all extant organisms on earth 
today. The work of iRohde et al.l (j2004j ). along with recent evidence that the radiatio n of 
modern humans from Africa occurred within the last 150,000 years (ILiu et al.l . |2006| ) 
suggests that - excluding the existence of a Homo erectus type Yeti or Bigfoot - every 
individual v in the population 14om that was (i) ancestral to Phom and (ii) living (say) 



200,000 years ago, satisfies either v D Pi 



horn 



Phom, or ^; n Pi 



horn 



0. Moreover, we can 

presumably be confident that no other non-human individual organism alive today is a 

descendant of any individual in Vtiom and, so, 14om would satisfy the conditions for the set 

> 

Vc mentioned above: it is tight, i.e. Vhom 
currently living organisms. 



Phom holds and it separates Phom from all other 



Thus, we may assume that Phom is, formally, a tight cluster in the set X of all extant 
organisms alive today. 

The example also underlines that, because of our specific choice of X, side lines with no 
descendants today (like, presumably, the Neanderthals) are of no direct interest in this 
context. Indeed, we may probably (that is, unless the Yeti or Bigfoot exists and belongs to 
the Homo erectus group) also take for Vhom all individuals that were ancestral to Phom and 
lived 2,000,000 years ago, which, however, would not work if we choose X to denote all 
humans from the last 1,000,000 years that had no children. 

In the case of haplo id reproduction, coalescence times are much longer, being of order n 



rather than log(?7,) (iHein et al.l . 120051 ). Nevertheless, consider a current population of n 
individuals with haploid reproduction. Suppose the ancestors of this population dating 
back as far as generations into the past constituted a homogeneous population was of 
approximately constant size, and was genetically isolated (i.e. if there were LGT events 
involving this ancestral population then they were restricted to exchanges between 



members of that population) and which left no other descendants today. Then, provided 
N >> n, this current population would be a likely candidate for a tight cluster in the set 
X of all extant organisms. □ 



3.3 Strict clusters 

We now describe a second class of clusters; we will see in Theorem [1] that these include the 
tight clusters, yet they are still guaranteed to form a hierarchy. 

Define a subset C to be a strict X-duster (relative to V and ^) provided that 

• V E V and C fl V 7^ implies that either C C or ^ C C - or, equivalently, 

V e D{DC) 01 ve D{CC) holds, and 

• the cousinship graph 



of C is connected. 

As an example, the non-singleton strict X-clusters of the graph G shown in Fig. 2 are 
{a, 6}, {d, e] and X. 

Notice that X is a strict cluster if and only if the weakest connectivity condition (C3) holds. 

3.4 Clusters based on ancestry 

We begin this sub-section with some further definitions. 
For any pair of elements {a, fo} in V, let 



be the set of common ancestors of a and h. Provided that ca(a, h) is non-empty, let 
mrca(a, h) be the maximal elements in ca(a, 6); this is often referred to as the set of the 
most recent common ancestors of a and b. For a,b,c E X, let us write ab\\c if ca(a, b) is 
non-empty, and for each v G mrca(a, b) there exists v' G mrca(a, c) and v" G mrca(6, c) such 
that v' -< V and v" -< v hold. 

As an example, for the graph G in Fig. 2, we have ab\ \x for each x G {c, d, e}, and we have 
de\\y precisely when y G {a,b}. 




ca(a, b) := {v E V : v ^ a and v ^ b} 



We will write ab\c under the strictly weaker condition that ca(a, b) is non-empty, and there 
exists, for each v e mrca(a, b), some v' e mrca(a, c) U mrca(6, c) with v' -< v. 

A dual notion to the ancestral relation 1 1 is the following: For a,b,c e X, let us write 

ab ^- cii ca(a, c) and ca(6, c) are both non-empty and there exist, for all v e mrca(a, c) and 

v' e mrca(6, c), some u, u' G mrca(a, b) (where u need not necessarily be different from u') 
such that V -< u and v' -< u' holds. Note that 1 1 is neither stronger or weaker than ±, that 
is, there are examples for which xx' _L y holds but fails (Fig. 3(a)) and also for which 

xx'lli/ holds but xx' _L y fails (Fig. 3(b)). 




Figure 3: (a) An acyclic digraph G on X = {x,x\y} for which {x^x'} is a tight cluster 
and a co-ancestral cluster but is not an ancestral cluster, (b) An acyclic digraph G on 
X — {x^x' ,y} ior which {x,x'} is an ancestral cluster but not a co- ancestral cluster. 



The following result summarizes a basic property of these relations, and will be useful in 
the next section. 

Lemma 3.1. Suppose that G is any finite, directed, cycle-free graph, with X CV. Given 
three distinct elements a,b,c E X : 

(i) At most one of ab\\c, ac\\b and bc\\a holds; 

(ii) At most two of ab\c,ac\b,bc\a holds; 

(iii) At most one of ab J- c,ac J- b,bc J- a holds. 



Proof: For part (i), assume that both ab\\c and ac\\b hold. Let v be any element in 
mrca(a, b); then there exists v' G mrca(a, c) with v' -< v. On the other hand, there also 
exists an element u G mrca(a, b) such that u ~< v' in view of ac\\b. Therefore, we have u -< v 
and u,v E mrca(a, b), a contradiction to the definition of mrca(a, b). The second and third 
parts follow by a similar proof by contradiction. This completes the proof of the Lemma. □ 



With these definitions, we say that C is a ancestral X -cluster (respectively relaxed 
ancestral X -cluster and co-ancestral cluster) if for all x,x' & C and y G X — C, we have 
xx'lly (respectively xx'\y and xx' _L y). Notice that the entire set X is both an ancestral 
cluster and a co-ancestral cluster under the intermediate connectivity condition (C2). 

Note that, even for a digraph G that has a vertex vq with vq = X, there may exist a tight 
X-cluster that is not an ancestral cluster, as Fig. 3(a) shows for C = {x,x'}. In this 
example, D{= C) = {v2, v^}, from which it is easily seen that C is a tight cluster. Note 
that V2 G mrca(x, x') yet V2 is not a descendant of any vertex in either mrca(x, y) = {vi} or 
mrca(x', y) = {f 1}. 



3.5 Clusters relative to a 'time scale' 



In this section, we exploit an additional aspect of evolution - the fact that the vertices of G 
have an associated 'date' (e.g. time when they were born) and this provides a further 
avenue to define a system of clusters. 

Suppose that, in addition to the digraph G = {V, A), with X C y, we have a map 
T : y — i> R that strictly preserves the partial order ^, i.e. 

u ^ V =^ T{u) < T{v). 

We refer to the pair (G, T) as a valuated digraph on X. Of course, the condition that such 

a map T exists is equivalent to the condition that G has no directed cycles 

( Bang- Jensen and Gutinl . 2008 ). but we think of T as being a specific map, where, in the 



biological context, T[v) would denote the time when the individual v was born (we may 
regard the present as time and so T is a map from V to the non-positive reals). 



Following ISteell (120071 ) we say that C C X is a Apresjan X -cluster relative to T if there 



exists t G M such that: 

(Tl) For all x,y E C, there exists v eV : v ^ x,y, T{v) > t; and 

(T2) For all x G C, 2/ G X - C, if ^; G y satisfies v ^x,y then T{v) < t. 



In words, C is an Apresjan X-cluster relative to T if every two individuals in G have at 
least one common ancestor after time t, but each individual in C and each individual in 
X — C have all their common ancestors earlier than t. 

We say that C C X is a strong Apresjan X-cluster relative to T if (Tl) is strengthened to: 



(Tl') For all x,y E C, and every v G mrca(a;, y), T{v) > t. 



Thus, C is a strong Apresjan X-cluster relative to T if every two individuals in C have all 
their most recent common ancestors after time t, but any individual in C and individual in 
X — C have all their common ancestors earlier than t. 



4 Main result 



We have described a variety of ways to construct a set of X-clusters from G. We now show 
that they all lead to hierarchies (in one case a weak hierarchy), and describe some 
relationships between them, in the following main result of this paper, the proof of which is 
given in the Appendix. 

Theorem 1. Suppose that G is any finite, directed, cycle-free graph, with X <ZV. 

1. The following sets form a hierarchy: 

(a) The set of tight X -clusters of G; 

(b) The set of strict X -clusters of G; 

(c) The set of ancestral X -clusters ofG; 

(d) The set of co- ancestral X -clusters ofG. 

2. The set of relaxed ancestral X -clusters of G forms a weak hierarchy. 

3. Suppose that {G,T) is a valuated digraph on X. Then the set of Apresjan X -clusters 
relative to T forms a hierarchy (as does the the subset of strong Apresjan X -clusters 
relative to T). 

4- Every tight X-cluster C of G is also a strict X-cluster and, under connectivity 
condition (C2), a co-ancestral cluster. If G has a valuation map T, G is also an 
Apresjan X-cluster relative to T. 



5 Discussion 



Our paper is motivated partly as a response to a currently promoted viewpoint that 
processes of reticulate evolution, such as ext ensive LGT imp l ies that no sensible or 



well- defined 'tree of life' can be constructed ( iDoolittld . Il999l : iKunin et al.l . l2005l : iLawton 



20091 ). However, this statement depends on how one views such a tree, and where the 



transfer events occ urred in it. For example, even if each gene has been transferred once 
during its history (jPagan and Martini . 120071 ). provided that these transfer events all 
occurred before the separation of certain populations then we may still expect to find 
Apresjan or stronger (e.g. tight) clusters, which will therefore form a tree. Consider, for 
example, the collection G of all extant mammals. The most recent common ancestors of 



mammals most likely occurred within the last 120 million years (lEizirik et all l200ll ). Thus 
if those genes that are found in mammals and which underwent a gene transfer event some 
time in their past did so at a much earlier stage of evolution (i.e. well before 120 million 
years ago) then the concept of a 'mammal tree' composed of clusters of a type described 
above seems reasonable. 

Neither are recent LGT events necessarily problematic. In particular, such events will not 
destroy even a tight cluster C provided they occur amongst those ancestors of C that are 
descendants of D(= C). 

For prokaryotes, where a tree structure is most vigorously called into question, the concept 
of a tree is still well defined, but it may indeed be poorly resolved (depending on the type 
of cluster considered, and the extent to which a LGT event from individual x to y might be 
counted as an arc in G from x to y - for example, one could indicate all such instances or 
just those for which the gene transfers survives to a present copy). In cases where LGT 
(and other types of reticulate evolution) are extensive and on-going, then set systems such 
as weak hierarchies may give a more informative picture of evolution than a tree. We have 
described one way to generate such a hierarchy above, but it may be useful to explore other 
approaches. 




Figure 4: An example to illustrate a violation of sampling consistency for strict clusters. 



In this paper, we have concentrated instead on ways by which a hierarchy on X can be 
constructed from G based on concepts of ancestry and separation. Of course, the 
possibilities we have outlined are by no means exhaustive, as there will surely be other 
combinations of conditions that will allow for a hierarchy or related set system. However, 
we would like any procedure for constructing a hierarchy to have some reasonable 
biological motivation and also, if possible, to satisfy some desirable properties. One such 
desirable property is that the procedure be 'robust' with respect to the possibility that we 
have not sampled or observed all individuals in X. We can make this precise as follows. 

Suppose that G is any finite, directed, cycle-free graph with X CV, and that y is a subset 
of X. Let GIF be the directed graph obtained from G by regarding the vertices in X — Y 



as unlabeled vertices. Now suppose we have a function (f) that associates to each such pair 
(X, G) a collection of subsets of X. We say that satisfies sampling consistency if it 
satisfies the condition: 

C e <p{X, G)^CnY e <p{Y, G\Y). 

We can extend this concept to valuated digraphs in the obvious way (namely, 
C e (j){X, G,T)^CnY e (j){Y, G\Y, T)). 

It can be checked that the following constructions satisfy sampling consistency: tight 
clusters, ancestral clusters, and Apresjan clusters (with respect to a time scale). However, 
the strict cluster construction can violate this condition - for example, consider the graph 
G in Fig. 4. Then G — {a, b, c} is a strict X-cluster where X — {a, 6, c, d}. But if we select 
Y — {a, c, d} then G DY = {a^c} is not a strict Y cluster in the graph G\Y, since the 
cousinship graph r(C fl Y) is not connected. 
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6 Appendix: Proof of Theorem 1. 

Proof of Part 1(a): Suppose that for two tight clusters Ci and C2 we have Ci fl C2 7^ and 
that C2 is not a subset of Ci. We will show that Ci C C2. Let Vi = D{= Ci) and 
V2 = D{= C2). By assumption, there exists x G Ci fl C2, y E C2 — Ci. First observe that if 
V2 ^Vi then V2 C which implies that C2 C Ci in violation of our assumption. Thus, 
there exists v E V2 — Vi. Now, since x,y E C2, and v E V2 there exists a directed path from 
f to X and a directed path from v to y. In particular these provide an undirected path P in 
G connecting x and y. But now, since x G Ci while y E X — Ci, and since Ci is tight (so 
Vi separates Ci from X — Ci) at least one vertex, say w, in P must lie in Vi. Regardless of 
where w lies on P we have v ^ w (since every vertex v' on P satisfies v ^ v') and so 
vj Clj'. Therefore, since -u; = Ci and = C2, we have Ci C C2, as required. This 
completes the proof of Part 1(a). 

To establish Part 1 (b ), suppose that C, C are strict X-clusters, and that C fl C" and 
C -C are both non-empty. We will show C C C. Take x ECnC'.y EC -C. By the 
connectivity of the cousinship graph T{C) there is a path in this graph from x to say 
X = xi,X2, ■ ■ ■ ,Xk = y. Let Xj, x^+i be the first pair of adjacent vertices in this path for 
which Xi E C P[C' and Xj+i E C — C. Since Xi and adjacent there is a vertex 

f G K for which Xi, Xj+i G "u" C C. Moreover, we have ^ fl C" 7^ (since Xj G C" fl Ij') and 
so the first condition in the definition of a strict cluster implies that either C" C v or 

C C'. But the second of these two inclusions is impossible, since Xj+i E ~v — C. Thus 
C' and since ''v' ^ C, this implies that C C C, as required to establish Part 1(b). 

For Part 1(c), assume, for the sake of contradiction, that C, C are ancestral clusters, and 
there exist three elements a, b, c with aEC — C',bEC' — C and c E C H C . Then, by 
definition, we have ac\\b and bc\\a, a contradiction to Lemma [3.1( i). A similar argument 
applies for Part 1(d). This completes the proof of Part 1. 



Proof of Part 2: Suppose that A, B, C are three relaxed ancestral clusters which violate the 
condition A n n C ^ {A n fi, A n C, S n C}. Then we can select 

X & B — C^y ^ A\~\C — B,z & B — A. We have xy\z (since x and y but not z are 
in A), and xz\y (since x and z but not y are in B), and yz\x (since y and z but not x are in 
C), in violation of Lemma [3.1( ii). 



Proof of Part 3: This result is frorn 



(lApresjanl . Il966l : iBryant and Berry 



Steell (2007), based on earlier related results from 



200ll : iDevauchelle et al.l . l2004j ). Since the proof is 



short, we provide it here for completeness. Suppose Ci, C2 are Apresjan X-clusters relative 
to T and there exists x G Ci fl C2, j/ G Ci — C2, 2; G C2 — Ci; we will show that this leads to 
a contradiction. For z G {1, 2}, let ti be a value of t for which (Tl), (T2) applies for 
C = Ci. If tl > t2 then, by condition (Tl) on Ci, there exists v with v ^ x,y with 
T{v) > tl > ^2- But applying (T2) to C2 gives T{v) < t2 (since y E X — C2), a 
contradiction. A similar argument applies if ti < ^2- 

Proof of Part 4- Suppose that C is a tight X-cluster. We first show that C is a strict 
X-cluster. Select any w G D(= C). Then to = C, and so the cousinship graph T{C) is a 
clique (and hence a connected graph). Now, suppose that C fl ^ 7^ 0, and that if is not a 
subset of C. We will show that C Clf. Select x G C r\~u ,y E — C. There exists a 
directed path in G from to a; and a directed path from v to y. In particular, these 
provide an undirected path P in G connecting x and y. Since x E C but y lies outside of 
C, path P must contain at least one vertex v' E D{= C) (since D{= C) separates C from 

X — C). Then v <v' and so v' C if . But v' = C (since v' E Vq) so that C C as 
required to establish that C is strict X-cluster. 

Next we show that C is a co-ancestral cluster, i.e. for any x,x'EC,yEX — Owe have 
xx' _L y. Let v he a vertex in mrca(a;,y) (such a vertex exists by (C2)) and consider the 
(undirected) path P from x to w to y. Since xeC to yEX — C, the fact that D{= C) 
separates C from X — C (because C is a tight cluster) implies that one vertex, say w, in P 
must lie in D{= C). The vertex w does not lie on the path from v to y, otherwise we have 
y E V = C, so w is in the path from v to x. Since x' E w , it follows that w is, or has as a 
descendant, a vertex in mrca(x, x'). A similar argument applies to any vertex in mrca(x', y) 
and so xx' ± y. Since this holds for all x,x' E C and yEX — C,C is a. co-ancestral cluster 
of X. 



For the final claim in Part 4, suppose that C is a tight X-cluster of G. We will show that 
(Tl) and (T2) hold for t = tc where: tc '■= ma.x{t{v) : v E D(= C)}. First select Vq E Vq 
with T{vo) = tc- Observe that for all x, x' E C, we have f ^ x, x' and since T(f 0) > tc we 
see that condition (Tl) is satisfied for t = tc, and v = vq. To verify condition (T2), suppose 
that X E C,y E X — C and there exists v ^ x,y with T{v) > tc- Consider the (undirected) 
path P in G from x to v and then to y. If G Vc then v = C which is impossible since 
V ^ y ^ y E V yet y is not an element of C. Thus v is not an element of Vc- Moreover, 
for any vertex w in P that is different from v, we have T{w) > tc (since v -< w and T is 
strictly monotone) and so w is also not an element of Vc (since all the vertices w' in Vc 
satisfy T{w') < tc). In summary, none of the vertices in P belongs to Vc, thus deleting Vc 



fails to disconnect x from y, violating the assumption that D[= C) separates C from 

X — C. This establishes property (T2), as required, and thereby completes the proof. □ 



